Data Preprocessing focuses on processing the data and making it suitable for the model.
Before building any machine learning model it is crucial to perform data preprocessing to feed the correct data to the model to learn and predict. Model performance depends on the quality of data feeded to the model to train.
It involves various steps like
These are some of the common steps. But most of these data preprocessing steps depend on case to case basis.
You're a marketing analyst and you've been told by the Senior Marketing Manager that recent marketing campaigns have not been as effective as they were expected to be. You need to analyze the data set in order to understand this problem and propose data-driven solutions. You are required to solve the following questions to generate a report for your management.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
df = pd.read_csv("marketing_data.csv")
df.sample(5)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | Complain | Country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 188 | 10949 | 1963 | PhD | Divorced | $72,968.00 | 0 | 0 | 12/16/13 | 8 | 1092 | 37 | 592 | 145 | 37 | 55 | 1 | 5 | 5 | 8 | 3 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | SP |
| 908 | 1055 | 1976 | Master | Married | $53,204.00 | 1 | 1 | 3/20/14 | 40 | 29 | 0 | 8 | 2 | 0 | 6 | 1 | 1 | 0 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | GER |
| 884 | 7079 | 1962 | Graduation | Divorced | $63,887.00 | 0 | 1 | 9/8/12 | 38 | 897 | 23 | 207 | 15 | 11 | 92 | 5 | 9 | 6 | 12 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | SA |
| 1821 | 7698 | 1976 | PhD | Married | $51,650.00 | 0 | 1 | 5/11/14 | 81 | 152 | 3 | 22 | 2 | 5 | 7 | 1 | 4 | 1 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | IND |
| 15 | 837 | 1977 | Graduation | Married | $54,809.00 | 1 | 1 | 9/11/13 | 0 | 63 | 6 | 57 | 13 | 13 | 22 | 4 | 2 | 1 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | SP |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null object 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Response 2240 non-null int64 26 Complain 2240 non-null int64 27 Country 2240 non-null object dtypes: int64(23), object(5) memory usage: 490.1+ KB
df.describe()
| ID | Year_Birth | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | Complain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 |
| mean | 5592.159821 | 1968.805804 | 0.444196 | 0.506250 | 49.109375 | 303.935714 | 26.302232 | 166.950000 | 37.525446 | 27.062946 | 44.021875 | 2.325000 | 4.084821 | 2.662054 | 5.790179 | 5.316518 | 0.072768 | 0.074554 | 0.072768 | 0.064286 | 0.013393 | 0.149107 | 0.009375 |
| std | 3246.662198 | 11.984069 | 0.538398 | 0.544538 | 28.962453 | 336.597393 | 39.773434 | 225.715373 | 54.628979 | 41.280498 | 52.167439 | 1.932238 | 2.778714 | 2.923101 | 3.250958 | 2.426645 | 0.259813 | 0.262728 | 0.259813 | 0.245316 | 0.114976 | 0.356274 | 0.096391 |
| min | 0.000000 | 1893.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2828.250000 | 1959.000000 | 0.000000 | 0.000000 | 24.000000 | 23.750000 | 1.000000 | 16.000000 | 3.000000 | 1.000000 | 9.000000 | 1.000000 | 2.000000 | 0.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 5458.500000 | 1970.000000 | 0.000000 | 0.000000 | 49.000000 | 173.500000 | 8.000000 | 67.000000 | 12.000000 | 8.000000 | 24.000000 | 2.000000 | 4.000000 | 2.000000 | 5.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 8427.750000 | 1977.000000 | 1.000000 | 1.000000 | 74.000000 | 504.250000 | 33.000000 | 232.000000 | 50.000000 | 33.000000 | 56.000000 | 3.000000 | 6.000000 | 4.000000 | 8.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 11191.000000 | 1996.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | 263.000000 | 362.000000 | 15.000000 | 27.000000 | 28.000000 | 13.000000 | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
df.describe(include='O')
| Education | Marital_Status | Income | Dt_Customer | Country | |
|---|---|---|---|---|---|
| count | 2240 | 2240 | 2216 | 2240 | 2240 |
| unique | 5 | 8 | 1974 | 663 | 8 |
| top | Graduation | Married | $7,500.00 | 8/31/12 | SP |
| freq | 1127 | 864 | 12 | 12 | 1095 |
df.isnull().sum()
ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Response 0 Complain 0 Country 0 dtype: int64
df.isnull().sum()/len(df)*100
ID 0.000000 Year_Birth 0.000000 Education 0.000000 Marital_Status 0.000000 Income 1.071429 Kidhome 0.000000 Teenhome 0.000000 Dt_Customer 0.000000 Recency 0.000000 MntWines 0.000000 MntFruits 0.000000 MntMeatProducts 0.000000 MntFishProducts 0.000000 MntSweetProducts 0.000000 MntGoldProds 0.000000 NumDealsPurchases 0.000000 NumWebPurchases 0.000000 NumCatalogPurchases 0.000000 NumStorePurchases 0.000000 NumWebVisitsMonth 0.000000 AcceptedCmp3 0.000000 AcceptedCmp4 0.000000 AcceptedCmp5 0.000000 AcceptedCmp1 0.000000 AcceptedCmp2 0.000000 Response 0.000000 Complain 0.000000 Country 0.000000 dtype: float64
Since we have missing values in the feature Income about 1%, which is very less. lets drop those missing values.
## Dropping null values.
df.dropna(inplace=True)
df.shape
(2216, 28)
df.rename(columns={' Income ':'Income'}, inplace=True)
df['Income'].unique()
array(['$84,835.00 ', '$57,091.00 ', '$67,267.00 ', ..., '$46,310.00 ',
'$65,819.00 ', '$94,871.00 '], dtype=object)
df['Income'] = df['Income'].str.replace('$','').str.replace(',','')
df['Income'] = df['Income'].astype(float)
df.head(2)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | Complain | Country | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1826 | 1970 | Graduation | Divorced | 84835.0 | 0 | 0 | 6/16/14 | 0 | 189 | 104 | 379 | 111 | 189 | 218 | 1 | 4 | 4 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | SP | 52 |
| 1 | 1 | 1961 | Graduation | Single | 57091.0 | 0 | 0 | 6/15/14 | 0 | 464 | 5 | 64 | 7 | 0 | 37 | 1 | 7 | 3 | 7 | 5 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | CA | 61 |
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
df['Dt_Customer'].dt.year.max()
# df['Dt_Customer'].dt.year.min()
2014
df['Age'] = df['Year_Birth'].apply(lambda x: 2014-x)
df['Age']
0 44
1 53
2 56
3 47
4 25
..
2235 38
2236 37
2237 38
2238 36
2239 45
Name: Age, Length: 2216, dtype: int64
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | Complain | Country | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1826 | 1970 | Graduation | Divorced | 84835.0 | 0 | 0 | 2014-06-16 | 0 | 189 | 104 | 379 | 111 | 189 | 218 | 1 | 4 | 4 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | SP | 44 |
| 1 | 1 | 1961 | Graduation | Single | 57091.0 | 0 | 0 | 2014-06-15 | 0 | 464 | 5 | 64 | 7 | 0 | 37 | 1 | 7 | 3 | 7 | 5 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | CA | 53 |
| 2 | 10476 | 1958 | Graduation | Married | 67267.0 | 0 | 1 | 2014-05-13 | 0 | 134 | 11 | 59 | 15 | 2 | 30 | 1 | 3 | 2 | 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | US | 56 |
| 3 | 1386 | 1967 | Graduation | Together | 32474.0 | 1 | 1 | 2014-05-11 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | AUS | 47 |
| 4 | 5371 | 1989 | Graduation | Single | 21474.0 | 1 | 0 | 2014-04-08 | 0 | 6 | 16 | 24 | 11 | 0 | 34 | 2 | 3 | 1 | 2 | 7 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | SP | 25 |
# REGEX: ---- Regular expressions are a powerful language for matching text patterns
df['Total_amount_spent'] = np.sum(df.filter(regex='Mnt'), axis=1)
## Percentage of the amount spent on Wines in Total_amount_spent.
df['MntWines']/df['Total_amount_spent']*100
0 15.882353
1 80.415945
2 53.386454
3 90.909091
4 6.593407
...
2235 53.991292
2236 9.090909
2237 59.870550
2238 19.305857
2239 15.677180
Length: 2216, dtype: float64
## Percentage of the amount spent on Gold Products in Total_amount_spent.
df['MntGoldProds']/df['Total_amount_spent']*100
0 18.319328
1 6.412478
2 11.952191
3 0.000000
4 37.362637
...
2235 11.320755
2236 29.090909
2237 4.530744
2238 4.555315
2239 13.358071
Length: 2216, dtype: float64
## Percentage of the amount spent on Gold Products in Total_amount_spent.
df['MntMeatProducts']/df['Total_amount_spent']*100
0 31.848739
1 11.091854
2 23.505976
3 9.090909
4 26.373626
...
2235 18.287373
2236 23.636364
2237 28.478964
2238 50.686913
2239 51.298701
Length: 2216, dtype: float64
df['TotalPurchases'] = np.sum(df.filter(regex='Purchases'),axis=1)
## Percentage of the NumdealsPurchases contribution to the TotalPurchases.
df['NumDealsPurchases']/df['TotalPurchases']*100
0 6.666667
1 5.555556
2 9.090909
3 25.000000
4 25.000000
...
2235 10.000000
2236 20.000000
2237 14.285714
2238 5.000000
2239 5.555556
Length: 2216, dtype: float64
## Percentage of the store Purchases contribution to the TotalPurchases.
df['NumStorePurchases']/df['TotalPurchases']*100
0 40.000000
1 38.888889
2 45.454545
3 50.000000
4 25.000000
...
2235 55.000000
2236 60.000000
2237 35.714286
2238 50.000000
2239 22.222222
Length: 2216, dtype: float64
## Percentage of Catalog purchases contribution to the TotalPurchases.
df['NumCatalogPurchases']/df['TotalPurchases']*100
0 26.666667
1 16.666667
2 18.181818
3 0.000000
4 12.500000
...
2235 10.000000
2236 0.000000
2237 7.142857
2238 20.000000
2239 27.777778
Length: 2216, dtype: float64
cmp = df.filter(regex='Cmp').sum()
cmp
AcceptedCmp3 163 AcceptedCmp4 164 AcceptedCmp5 162 AcceptedCmp1 142 AcceptedCmp2 30 dtype: int64
plt.pie(cmp,autopct='%0.2f',labels=cmp.index,explode=[0,0.2,0,0,0])
plt.show()
products = df.filter(regex='Mnt').sum()
products
MntWines 676083 MntFruits 58405 MntMeatProducts 370063 MntFishProducts 83405 MntSweetProducts 59896 MntGoldProds 97427 dtype: int64
plt.pie(products,autopct='%0.2f',labels=products.index,explode=[0.1,0,0,0,0.2,0])
plt.show()
customers_accepted = df[df['Response']==1]
sns.displot(customers_accepted['Age'])
plt.show()
sns.displot(customers_accepted['Country'])
plt.show()
df["Dependents"] = df["Kidhome"] + df["Teenhome"]
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.boxplot(y=df["Total_amount_spent"],x=df["Dependents"])
plt.subplot(1,2,2)
sns.boxplot(y=df["TotalPurchases"],x=df["Dependents"])
plt.show()
dff = df.drop('ID',axis=1)
plt.figure(figsize=[18,7])
sns.heatmap(dff.corr(),annot=True,cmap='viridis')
plt.show()
complain_edu = df[df['Complain']==1]
sns.countplot(complain_edu['Education'])
plt.show()
sns.pairplot(df,hue='Response',x_vars=['Total_amount_spent','MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'],
y_vars=['Total_amount_spent','MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'])
<seaborn.axisgrid.PairGrid at 0x11613e1fdf0>
sns.pairplot(df,hue='Education',x_vars=['Total_amount_spent','MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'],
y_vars=['Total_amount_spent','MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'])
<seaborn.axisgrid.PairGrid at 0x11617ba4910>
sns.pairplot(df,hue='Marital_Status',x_vars=['Total_amount_spent','MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'],
y_vars=['Total_amount_spent','MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'])
<seaborn.axisgrid.PairGrid at 0x1161a35d7e0>
## lets make a copy of the dataset.
df1 = df.copy()
## Dropping redundant columns, dropping column Year_Birth also, beacuse we have created a new column 'age' out of it
df1.drop(['ID', 'Country','Dt_Customer', 'Year_Birth'], axis = 1, inplace = True)
df1.head(2)
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | Complain | Age | Total_amount_spent | TotalPurchases | Dependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduation | Divorced | 84835.0 | 0 | 0 | 0 | 189 | 104 | 379 | 111 | 189 | 218 | 1 | 4 | 4 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 52 | 1190 | 15 | 0 |
| 1 | Graduation | Single | 57091.0 | 0 | 0 | 0 | 464 | 5 | 64 | 7 | 0 | 37 | 1 | 7 | 3 | 7 | 5 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 61 | 577 | 18 | 0 |
# making an instance of the label encoder class
le = LabelEncoder()
# label encoding all the categorical columns that have more than 2 unique values
df1['Education']=le.fit_transform(df1['Education'])
df1['Marital_Status']=le.fit_transform(df1['Marital_Status'])
df1.head(2)
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Response | Complain | Age | Total_amount_spent | TotalPurchases | Dependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 84835.0 | 0 | 0 | 0 | 189 | 104 | 379 | 111 | 189 | 218 | 1 | 4 | 4 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 52 | 1190 | 15 | 0 |
| 1 | 2 | 4 | 57091.0 | 0 | 0 | 0 | 464 | 5 | 64 | 7 | 0 | 37 | 1 | 7 | 3 | 7 | 5 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 61 | 577 | 18 | 0 |
X = df1.drop('Response', axis=1)
y = df1['Response']
# Checking the count of records in the target column accepted the last campaign or not(0 and 1)
df1["Response"].value_counts()
0 1883 1 333 Name: Response, dtype: int64
# train_test_split() is used to divide dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('Shape of training feature:', X_train.shape)
print('Shape of testing feature:', X_test.shape)
print('Shape of training label:', y_train.shape)
print('Shape of training label:', y_test.shape)
Shape of training feature: (1772, 27) Shape of testing feature: (444, 27) Shape of training label: (1772,) Shape of training label: (444,)
# declaring an object of standardscaler class
sc = StandardScaler()
# fit_transform() method first trains the Scaler on dataset and then transforms it between 0 and 1
X_train = sc.fit_transform(X_train)
# transform() method only transforms the dataset based on what it has learnt on the dataset before
X_test = sc.transform(X_test)
# Create an instance
log_reg = LogisticRegression()
#Learning
log_reg.fit(X_train,y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# Check for prediction results
y_pred = log_reg.predict(X_test)
# Check for accuray of the model
accuracy_score(y_test,y_pred)
0.8828828828828829
## Confusion matrix
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(log_reg, X_test, y_test)
plt.show()
## Compute precision, recall and F1-score
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.89 0.98 0.93 371
1 0.82 0.37 0.51 73
accuracy 0.88 444
macro avg 0.85 0.68 0.72 444
weighted avg 0.88 0.88 0.86 444
# the roc_curve() returns the values for false positive rate, true positive rate and threshold
# pass the actual target values and predicted probabilities to the function
from sklearn.metrics import roc_curve
from sklearn import metrics
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
# plot the ROC curve
plt.plot(fpr, tpr)
# set limits for x and y axes
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
# plot the straight line showing worst prediction for the model
plt.plot([0, 1], [0, 1],'r--')
# add plot and axes labels
# set text size using 'fontsize'
plt.title('ROC curve for Marketing Campaign Response Classifier (Full Model)', fontsize = 15)
plt.xlabel('False positive rate (1-Specificity)', fontsize = 15)
plt.ylabel('True positive rate (Sensitivity)', fontsize = 15)
# add the AUC score to the plot
# 'x' and 'y' gives position of the text
# 's' is the text
# use round() to round-off the AUC score upto 4 digits
plt.text(x = 0.02, y = 0.9, s = ('AUC Score:', round(metrics.roc_auc_score(y_test, y_pred),4)))
# plot the grid
plt.grid(True)